Skip to content

feat: working clean/smudge round-trip for a Rust subset#26

Merged
bdelanghe merged 10 commits into
mainfrom
claude/git-ast-architecture-tdhbqs
Jun 26, 2026
Merged

feat: working clean/smudge round-trip for a Rust subset#26
bdelanghe merged 10 commits into
mainfrom
claude/git-ast-architecture-tdhbqs

Conversation

@bdelanghe

Copy link
Copy Markdown
Collaborator

What this does

Turns the filter skeleton into a real, git-invoked AST round-trip. With the filter installed, git add stores Rust in canonical form and git checkout returns it — so reformatting never enters history.

git add      →  clean:  your .rs ──Tree-sitter parse──▶ tree ──printer──▶ canonical bytes (stored)
git checkout →  smudge: stored bytes ──▶ working file   (identity; already canonical)

Before this PR the filter was a no-op that didn't even speak git's protocol; the only "round-trip" was a string-prefix marker in a unit test.

Changes

  • printer.rs — the AST-native core. Parses Rust with Tree-sitter and re-emits canonical source by walking the tree. Fail-closed: syntax errors reject the commit; any unsupported node kind returns an error rather than silently corrupting code.
  • pktline.rs — codec for git's long-running filter pkt-line framing.
  • filters.rs — implements the real filter-process protocol (handshake → capabilities → per-blob). clean canonicalizes *.rs, smudge is identity, non-Rust passes through.
  • setup.rs / git-ast setup — one command to register the filter + .gitattributes in a repo (idempotent).
  • examples/demo.sh — end-to-end proof: a pure reformat produces no diff, a real a + ba - b change shows a clean one-line diff.

Verified

18 tests pass; cargo fmt --check and cargo clippy --all-targets -- -D warnings are clean. Demonstrated through real git add/git checkout/git diff in a throwaway repo (see examples/demo.sh).

Scope — deliberately honest

  • One language, a documented subset of it (functions, params, blocks, let, binary/call/macro expressions, literals, comments). Widening coverage is additive — one arm per node kind.
  • Diff and merge drivers remain placeholders. Making those structural depends on the hardest open problem — stable AST node identity across versions — which this PR does not address. Canonical formatting removes formatting churn from history; it does not track a node through a move or rename. See docs/planning/scope.md.

🤖 Generated with Claude Code


Generated by Claude Code

claude added 10 commits June 25, 2026 20:00
Turn the filter skeleton into a real, git-invoked AST round-trip:

- printer: parse Rust with Tree-sitter and re-emit canonical source by
  walking the tree. Fail-closed — syntax errors and unsupported node
  kinds error rather than silently corrupting code.
- pktline: implement Git's long-running filter pkt-line codec.
- filters: speak the real `filter-process` protocol; `clean`
  canonicalizes `*.rs`, `smudge` is identity, non-Rust passes through.
- setup: `git-ast setup` registers the filter + .gitattributes in a repo.
- examples/demo.sh: end-to-end proof that reformatting produces no diff
  while a real change shows a clean one.

Scope stays honest: one language, a documented subset, fail-closed.
Diff/merge drivers remain placeholders pending stable node identity,
which this does not address.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NCp6PSoWKvsbFWyav6CeeC
…sport)

Add a README section framing the hard, deferred problem precisely: node
identity is heuristic not exact, computed by tree-matching rather than
stored, helped by content-addressed subtree hashing, and git notes are a
transport for attribution across rewrites — not the identity mechanism.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NCp6PSoWKvsbFWyav6CeeC
frond exercises the same parse -> regenerate -> compare primitive for
JavaScript/TypeScript (SWC on Deno) that git-ast does for Rust
(Tree-sitter). Cross-link them: frond validates round-trip fidelity, the
prerequisite git-ast's canonical printer depends on.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NCp6PSoWKvsbFWyav6CeeC
Unison makes identity = hash of the normalized AST a language primitive,
giving rename/move stability for free. Note the two honest caveats: it
does not dissolve identity through an edit (namespace history records the
succession), and it is greenfield where git-ast must retrofit the same
property onto git + mainstream languages.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NCp6PSoWKvsbFWyav6CeeC
Capture the "how" that makes node identity tractable: split into a
content-addressed model store (the AST, identity recorded) and a text
projection store (canonical source, what humans/git/CI see), kept in
lockstep by the bidirectional transform.

Use Dolt for the model store: AST as keyed rows, prolly-tree cell-level
merge, and per-node attribution via dolt blame as a primitive. Honest
boundaries: Dolt removes the plumbing not the semantics (you still define
the keys = node identity), two heterogeneous stores carry a lockstep
invariant, and conflicts move to cell-level rather than vanishing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NCp6PSoWKvsbFWyav6CeeC
Reframe the node-identity section around prior art and a defensible
position:

- Identity is a vector, not a scalar (cf. Kythe VName). Split the
  dimensions that aren't atomic: content shallow vs deep (Merkle),
  name lexeme vs binding, definition vs use/call (export surface as
  contract), location. Note dimensions differ in epistemic cost, and
  that content equivalence over-merges clones (equivalence != persistence).
- Three families for establishing correspondence: by construction
  (CRDT TreeId / Unison hash), by operation (RefactoringMiner, CodeShovel),
  by snapshot matching (GumTree, the fallback).
- Thesis: record the edit, don't reconstruct it — agent-authored code can
  emit edit provenance, making identity durable by construction.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NCp6PSoWKvsbFWyav6CeeC
- printer: add convergence, idempotence, and purity property tests, plus
  a "Determinism contract" doc section (convergent, idempotent, no ambient
  nondeterminism, fail-closed; canonical form versioned by grammar+printer).
- README: surface the determinism/idempotence guarantee in status, and add
  a "provenance pipeline" section grounding each identity form (content
  shallow/deep, name lexeme/binding, def-vs-use, location, operation,
  authorship) in a have-today vs what's-needed table with prior art. Thesis:
  capture provenance as early as possible (the agent sits at stage 1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NCp6PSoWKvsbFWyav6CeeC
A cucumber-rs suite (tests/features/claims.feature) drives real git with the
built binary as the clean/smudge filter, verifying end to end: reformatting
shows no diff, a real change does, different formattings store byte-identical
blobs, checkout round-trips to canonical source, syntax errors are rejected
(fail-closed), and non-Rust files pass through unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NCp6PSoWKvsbFWyav6CeeC
Clarify the operational split: clean is the right place to parse/emit the
content-addressed AST (pure, deterministic), but clean/smudge are the wrong
place to write the model store (no commit context; filters run during diff/
stash/archive/checkout). Stateful model-store writes belong in commit/ref
hooks (post-commit, post-rewrite, post-checkout, post-receive). git stores
what is reparseable (canonical text); the model store stores what is not
(operation identity, who/which-agent authorship).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NCp6PSoWKvsbFWyav6CeeC
Small proof-of-concept of the first verbspec read verb — "look at the AST":
`git-ast inspect [FILE]` parses Rust and lists top-level definitions, each
tagged with a deterministic content hash over canonical form, so identity is
invariant under reformatting. Backed by `printer::inspect` + tests.

README: add "The interface: verbs (verbspec)" — read verbs (inspect/find/
blame) and write verbs (rename/extract/generate = stage-1 provenance capture),
with verbspec as the author-once-project-everywhere delivery vehicle that puts
an agent at stage 1. Link verbspec under related projects.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NCp6PSoWKvsbFWyav6CeeC
@bdelanghe bdelanghe marked this pull request as ready for review June 26, 2026 17:07
@bdelanghe bdelanghe merged commit d5b68f3 into main Jun 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants